Path identification for network data

ABSTRACT

A solution is provided wherein a master process and two or more drone processes may be utilized to identify path information containing a pattern. The master process may send the pattern to the two or more drone processes, which may identify the pattern in path data. Each drone process may then send the paths that satisfy the pattern back to the master process, which may aggregate the path data so that two or more identical paths appearing in the path data are reduced to a single occurrence of a path.

RELATED APPLICATION

This application is related to U.S. patent application Ser. No. ______,entitled “PATH INDEXING FOR NETWORK DATA” (Attorney Docket No.YAH1P055), filed concurrently herewith by Jagdish Chand, Suresh Antony,Rajesh Bhargava, Avanti Nadgir, and Jagannatha Narayanareddy.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to network usage data. More particularly,the present invention relates to path identification for network data.

2. Description of the Related Art

The process of analyzing Internet-based actions such as web surfingpatterns is known as web analytics. One part of web analytics isunderstanding how user traffic flows through a network (also known asuser paths). This typically involves analyzing which nodes a userencounters when accessing a particular network. In large networks suchas, for example, large search engine/directories, billions of pageviewsmay be generated per day. As such, analyzing this huge amount of datacan be daunting. Such analysis is needed, however, to determine commonuser behavior in order to optimize the network for better userengagement and network integration.

Due to the plentiful nature of this network data, however, performinganalysis can be time-consuming. Even the identification of usefulpatterns can take hours or days, amounts of time that are unacceptableto most of the people interested in finding the patterns (e.g.,managers, CEOs, etc.). As such, what is needed is a faster way toidentify useful patterns in such a large data set.

SUMMARY OF THE INVENTION

A solution is provided wherein a master process and two or more droneprocesses may be utilized to identify path information containing apattern. The master process may send the pattern to the two or moredrone processes, which may identify the pattern in path data. Each droneprocess may then send the paths that satisfy the pattern back to themaster process, which may aggregate the path data so that two or moreidentical paths appearing in the path data are reduced to a singleoccurrence of a path.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating the structure of the files inaccordance with an embodiment of the present invention.

FIG. 2 is a diagram illustrating an architecture of an indexing enginein accordance with an embodiment of the present invention.

FIG. 3 is a diagram illustrating a path file, node path index file, andnode index file for the first bucket in the above example.

FIG. 4 is a diagram illustrating an architecture for the efficientidentification of patterns in path data in accordance with an embodimentof the present invention.

FIG. 5 is a diagram illustrating an example of how patterns areextracted using a drone in accordance with an embodiment of the presentinvention.

FIG. 6 is a flow diagram illustrating a method for identifying pathinformation containing a pattern in accordance with an embodiment of thepresent invention.

FIG. 7 is a flow diagram illustrating a method for identifying pathinformation containing a pattern in accordance with another embodimentof the present invention.

FIG. 8 is a flow diagram illustrating 702 of FIG. 7 in more detail.

FIG. 9 is a block diagram illustrating an apparatus for identifying pathinformation containing a pattern in accordance with an embodiment of thepresent invention.

FIG. 10 is a block diagram illustrating an apparatus for identifyingpath information containing a pattern in accordance with anotherembodiment of the present invention.

FIG. 11 is a block diagram illustrating 1002 of FIG. 10 in more detail.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Reference will now be made in detail to specific embodiments of theinvention including the best modes contemplated by the inventors forcarrying out the invention. Examples of these specific embodiments areillustrated in the accompanying drawings. While the invention isdescribed in conjunction with these specific embodiments, it will beunderstood that it is not intended to limit the invention to thedescribed embodiments. On the contrary, it is intended to coveralternatives, modifications, and equivalents as may be included withinthe spirit and scope of the invention as defined by the appended claims.In the following description, specific details are set forth in order toprovide a thorough understanding of the present invention. The presentinvention may be practiced without some or all of these specificdetails. In addition, well-known features may not have been described indetail to avoid unnecessarily obscuring the invention.

Common business questions that need to be answered by analyzing a largenetwork user path data set include:

1. What are the top paths traversed from a particular node to anotherparticular nodes? (e.g., what paths did users commonly follow to go fromYahoo! Finance to Yahoo! Sports).

2. What are the top paths traversed from a particular node to anotherparticular node that encompass certain paths (e.g., what paths did userscommonly follow to go from Yahoo! Finance to Yahoo! Sports that includedpassing through Yahoo! Entertainment first).

3. What are the top paths traversed from a particular node? (e.g., whatpaths did users commonly follow after Yahoo! Finance).

4. What are the top nodes users left off at without reaching adestination node (starting at some node followed by a sequence ofnodes)?

5. What are the top referrers for a given sequence of nodes?

6. What are the nodes that have a maximum affinity to a given node?

The beginning point for various embodiments of the present invention maybe a data set of visited paths. This path information may be generatedby any number of mechanisms. In an embodiment of the present invention,the paths in the data set may first be evenly split into multiplebuckets. A bucket is simply an abstract organizational constructconnoting a grouping of information. This allows each of the buckets tobe processed in parallel by one or more computers and/or processors. Itshould be noted that each of the buckets will typically wind upcontaining all the nodes in the domain set in that paths are notdeliberately ordered into specific buckets. However, no limitations areplaced on the possibilities for various groupings, including groupingsthat are made for other purposes beyond the scope of the disclosure,such as grouping certain users, geographic regions, etc. together.

Network path information related to each of the buckets may be organizedinto three files: a node index file, a node path index file, and a pathfile. In one embodiment of the present invention these files may be in abinary format. FIG. 1 is a diagram illustrating the structure of thefiles in accordance with an embodiment of the present invention. Eachbucket may contain one of each of these three files. The path file 100may contain the raw path information from the data set (for the pathsplaced in this particular bucket). The path file may have one entry 102for each path. Each entry may include the path itself 104 (expressed,for example, as an ordered list of nodes), information about the lengthof the path 106, the frequency with which the path occurred 108 (in thedata corresponding to the particular bucket), and an offset 110. Theoffset may represent the location within the file where the entry ispresent (i.e., the number of entries in the file preceding the currententry). For example, if the entry 102 is the 20th entry in the file, theoffset may be 19.

The node path index file 112 may contain an entry for each occurrence ofa node in all the paths associated with the bucket. Each entry may carryinformation about that node in the corresponding path file 100. It maycontain the position 114 of the node in the path and an offset 116 intothe path file 100 to directly access the information about the path.This offset may also be thought of as a pointer to a particular area ofthe path file 100 that contains the information about the path.

The node index file 118 may contain one entry for each node that ispresent in the paths (i.e., a single entry for the node even if the nodeis present in multiple paths). An entry may also be present for a patheven if the path is not present in the corresponding bucket. Each entry120 may contain a count 122 reflecting the number of entries in the nodepath index file 112 for the given node. Each entry 120 may also containan offset 124 pointing to the first entry for the node in the node pathindex file 112.

Given these three files, data may be accessed very quickly as only theinformation that is relevant is read by directly navigating to thatlocation in the index files. For example, to obtain all the differentpaths users have navigated after visiting a Node N, the following methodmay be performed. First, the node index file 118 may be accessed todetermine where the Node N is present. Once this entry is found, theoffset 124 may be obtained for this node and the number of entries to bescanned may be obtained by the count 122. Then, using the offset 124,the specific entry in the node path index file 112 may be located.Starting from this entry, a number of entries equal to the retrievedcount 122 may be selected. For each of these selected entries, theoffsets 116 may be used to identify and extract the corresponding pathsin the path file 100.

It should be noted that the use of buckets is optional. Certainimplementations are envisioned wherein there are no buckets and the pathfile 100 contains all of the path information for the entire data set.The same may be said for the node path index file 112 and the node indexfile 118.

FIG. 2 is a diagram illustrating an architecture of an indexing enginein accordance with an embodiment of the present invention. Aggregatedraw path data 200 and the corresponding frequencies may be passed to anindexing engine 202. The indexing engine 202 may include a path indexgenerator 204 and a node index generator 206. The path index generatormay be called for each of the individual buckets to generate a path file208. This may include writing a binary record for each path, the recordcontaining an offset at which it is written, as well as the length ofthe path and the sequence of nodes that form the path. This may be avariable sized record. Offset and position of node within each path maybe tracked separately.

The node index generator 206 may then generate the node path index file210 and the node index file 212. This process may utilize the nodeposition and the node offset values generated by the path indexgenerator. There may be an entry for each occurrence of a node in thenode path index file 210. Each entry may have two components: pathoffset and the position of the node within the path. The node index file212 may be an index into the node path index file 210 for each node.

An example is provided for illustrative purposes. This example is notintended to be limiting. Assume that the following distinct paths are inthe raw input data set:

-   -   1:5:10:2 2    -   1:5:9:10 1    -   1:5:10:5 1    -   1:8:9:10:11:8 10    -   2:10:11:12 10    -   2:11:12 5        where each line indicates one distinct path having two        components: the nodes in the path and the payload (frequency).        Here, n₁:n₂:n₂ . . . indicates the path. Each n_(i) is the        encoded integer value of the node. The number after the path is        the frequency (the number of instances where the path occurs in        the overall data set).

If there are three output buckets, then each bucket may get two paths.It should be noted that in real-world situations the paths are morelikely to be on the order of 500 million with each path containing up to600 nodes, but for obvious reasons such a complex example will not bedescribed in this document.

The first bucket may contain:

-   -   1:5:10:2 2    -   1:5:9:10 1

The second bucket may contain:

-   -   1:5:10:5 1    -   1:8:9:10:11:8 10

The third bucket may contain:

-   -   2:10:11:12 10    -   2:11:12 5

FIG. 3 is a diagram illustrating a path file, node path index file, andnode index file for the first bucket in the above example. Here, thepath file 300 for the first bucket contains two paths. Path file 300begins with the sequence 0 4 2, which correspond to the offset, length,and frequency, respectively, corresponding to the first path. Then thepath file 300 contains the first path itself (1 5 10 2). Then the pathfile 300 contains the offset, length and frequency for the second path(28 4 1) followed by the second path (1 5 9 10). Note that the secondoffset is 28 because the first path record has seven entries. In thisexample, each entry may be represented using four bytes, thus the secondpath information begins at the 28th byte. Alternatively, the offset maybe based upon the number of the corresponding entry with respect toother entries, regardless of the size of each entry (e.g., the eighthentry may have an offset of seven).

The node path index file 302 may then contain information for each ofthe nodes in this bucket. The paths in this bucket have only 5 totaldifferent nodes. These are 1, 2, 5, 9, and 10. For node 1, the nodeappears in both paths in the bucket, as such, the node path index filecontains two records for node 1. Here, the first record for node 1contains 0 1, indicating the offset and position, respectively of thenode. That is, this first record indicates that node 1 appears in thepath beginning at offset 0 in the path file, in the first position inthe path. Likewise, the second record (i.e., 28 1) indicates that node 1appears in the path beginning at offset 28 in the path file, in thefirst position in the path. Each record in the node path index file 302may comprise 8 bytes (four bytes each for the offset and the position).

The node index file 304 may contain information on all the nodes presentin the whole data set. This may include nodes that are not present inthe bucket. In an alternative embodiment, only nodes present in thebucket are represented in the node index file 304. In this example,however, nodes present in the data set but not present in the buckethave entries stored as all zeros. Each record in the node index file 304has two components, the first one giving the number of entries for thecorresponding node in the node path index file for this bucket, and thesecond one giving the offset at which records corresponding to the nodeare available in the node path index file for this bucket. Here, theentry for node 1 indicates that there are two entries in the node pathindex file corresponding to node 1 and these entries begin at offset 0.Likewise, the entry for node 2 indicates that there is only 1 entry inthe node path index file corresponding to node 1 and th entry begins atoffset 16.

Analysis of the path information in order to answer relevant businessquestions is simplified by use of various embodiments of the presentinvention. The efficient identification of patterns in path data may beaccomplished by first distributing pattern identification among multipleprocesses, which allows for parallel processing. Then the patterns maybe identified and path information aggregated at the partition level.Then the data from all the partitions may be aggregated, and finally thetop data based on the payload may be identified. The payload may containany other information regarding the path. However, in an embodiment ofthe present invention, the payload holds frequency information (i.e.,information regarding the number of times the path appears in the dataset). FIG. 4 is a diagram illustrating an architecture for the efficientidentification of patterns in path data in accordance with an embodimentof the present invention. Three main components may perform theabove-identified processes. These components may include a master 400, atop data identifier 402, and multiple drones 404 a, 404 b.

Referring first to the master 400, this module may act generally todistribute the work among the drones 404 a, 404 b and aggregate the datareturned by the drones 404 a, 404 b. More specifically, the master 400may first encode pattern information to match the format in which thedata is stored using a node encoder 406. If the data is stored as binaryindex files as described above, then the encoding may includetransforming the pattern information to a series of integerscorresponding to nodes. Mapping information may be stored in fast accessencode files 410 and the node encoder 406 may look up the user pattern(e.g., a sequence of web pages) and convert the pattern definition intoan integer representation to match the data stored in the binary indexfiles. The master 400 may then distribute the buckets uniformly amongthe available drones 404 a, 404 b using a work distributor 408. As theinput data is partitioned into several buckets, each of the drones 404a, 404 b gets to process a subset of the buckets.

Once the drones 404 a, 404 b return sorted data (described in moredetail below), the master 400 may aggregate the sorted data using a dataaggregator 412. Although each drone 404 a, 404 b may act on a differentdata set, since patterns are being identified, it is possible that thesame pattern may be returned by different drones. As such, the master400 may aggregate the payload from all the drones to identify suchduplications and handle them accordingly (e.g., aggregate two or moreidentical patterns to a single pattern having a frequency count).Finally, the master 400 may send the aggregated data to the top dataidentifier 402.

Referring to the drones 404 a, 404 b, these modules may generallyextract requested patterns. These patterns may be specified by users, ormay be generated by the drones or other processes, in order to aid inanswering questions relevant to users. These patterns may be extractedfrom specified buckets, and the drones may then aggregate the commondata and send the results to the master 400. As such, the drones 404 a,404 b may have access to the binary index files 414 a, 414 b whereas themaster 400 and top data identifier 402 may not.

Specifically, each drone may first identify all the paths that satisfy agiven pattern (which may include a specified source, destination, andvia nodes, if any). The identification process may work backwards, sincethe destination node is typically the convergence node and hence willhave fewer number of paths to be considered. Since there may be multiplenodes specified in each of the patterns, the identification process maycollect paths, taking into consideration all the nodes in any step. If aconstraint is specified to extract paths with certain patterns, eachdrone may then perform pattern matching among the identified paths. Forexample, given a pattern where a sequence of nodes are expected to beadjacent to each other or separated by a constant number of nodes inbetween, the drones may examine identified paths satisfying the patternand remove paths that do not meet the constraint. Once the paths thathave valid patterns have been identified, the desired information may beextracted by those paths and stored in memory. It should be noted thatthe aforementioned steps performed by each drone may then be repeatedfor each bucket assigned to the drone. once this is completed, all theextracted information from each of the buckets may be aggregated so thatthe payload for the same identified pattern is added together. Thisaggregated data may then be sorted and sent to the master 400 by eachdrone 404 a, 404 b.

Referring to the top data identifier 402, this module may generally beinstructed to fetch the top N results (patterns and associated payload)out of all the identified results. This module may also produce summarydata (e.g., the total number of patterns identified for the specifiedpattern and their total payload) in addition to the top N results. Thismodule may get the aggregated data from the master.

Specifically, the top data identifier 402 may first parse the input dataand extract the pattern and its associated payload. Then it may storethe data associated with the top payload, eliminating the insignificantdata by keeping only the summary (total distinct data sets and theirtotal payload). Then a summary followed by the top data and theirpayload may be outputted. Here, the top data (patterns) may be decoded(from, e.g., integer representation to web page identification) with anode decoder 416 using the stored mapping information from the dataaccess decode files 418.

FIG. 5 is a diagram illustrating an example of how patterns areextracted using a drone in accordance with an embodiment of the presentinvention. For simplicity, only one bucket of data with three paths isconsidered in this example. The paths are labeled as 500 in FIG. 5.Given these three paths, the binary index files for these paths arelabeled as 502 in FIG. 5. Assume that the drone is given the task ofextracting the patterns that begin with node 5, go through node 9, andend with node 10.

The drone may first identify the paths with node 10 and store thecorresponding end positions in the paths. This may be achieved bylocating the information for node 10 in the node index file. From thisit can be seen that node 10 occurs 3 times and the information about theposition of the node in the corresponding paths is at offset 72 in thenode path index file. From the node path index file, it can be seen thata path containing node 10 at position 3 (in path 1, which starts atposition 0 in the path file), a second path with node 10 at position 4(in path 2, which starts at position 28 in the path file), and a thirdpath with node 10 at position 4 (in path 2, which starts at position 56in the path file). A data structure may be set up as labeled as 504 inFIG. 5, with starting and intermediate (via) positions initialized toinvalid (e.g., −1).

For the paths identified in the first step, the drone may then obtainthe start positions for node 5. To facilitate this, node 5 may belocated in the node index file. Node 5 occurs 3 times. Since all of therelevant paths were identified in the previous step, the start positionsfor the paths in the data structure 504 may be updated. If there werepaths having node 5 for which there are no entries in the data structure504, then those paths would have been ignored. Additionally, if theposition of a start node in a path is more than the end position (i.e.,node 5 appears after node 10 in the path), then such paths will also beignored. The data structure 504 is then updated with the start positioninformation to produce data structure 506.

For the paths identified in the previous steps, the drone may thenfilter out those that contain node 9 in an intermediate position. Onceagain the node index file may be accessed to determine that node 9 ispresent in two paths at position 3. Since this position falls in therange between the start position and the end position, the path isconsidered valid and the data structure 506 is updated to include theintermediate position information to produce data structure 508. Sinceone of the 3 paths in data structure 506 wound up not containing node 9in an intermediate position, the data structure 508 still reflects aninvalid entry for the intermediate position of this path. It should alsobe noted that if multiple intermediate nodes are specified as part ofthe pattern, then this intermediate node inspection step is repeated foreach of the specified intermediate nodes.

Given data structure 508, the drone may then proceed to extract thecorresponding path data. Since the path beginning at offset 0 containsan invalid entry in the intermediate position, this path will beignored. The pattern identified as beginning at position 2 and ending atposition 4 at offset 28 may then be retrieved, resulting in the pattern“5:9:10”. Likewise, the pattern identified as beginning at position 2and ending at position 4 at offset 56 may be retrieved, which alsoresults in the pattern 5:9:10. Since the same pattern was obtained fromtwo different paths with different payloads, the drone may thenaggregate the payload and stream the pattern back with the aggregatedpayload. Here, the second path had a payload of 1 and the third path hada payload of 5. Thus, the drone may aggregate this information into asingle pattern of 5:9:10 with a payload of 6. if there is a need toperform pattern matching after extraction of data from the path indexfiles (e.g., adjacency checks), the pattern matching may be performed atthis time. The drone then sends the extracted patterns to the master,which then performs the aggregation of the payload fields for identicalpatterns from all the drones. For example, if another drone returned thesame pattern (5:9:10) with a payload of 2, the master may aggregate allthese identical patterns to result in a payload of 8.

FIG. 6 is a flow diagram illustrating a method for identifying pathinformation containing a pattern in accordance with an embodiment of thepresent invention. The path information may relate to network nodesvisited by users of a computer network. The method may be executed at amaster process. At 600, the pattern may be encoded in a format matchinga format in which the path information is stored. Mapping informationrelating to the encoding may be stored in a mapping file. At 602, thepattern may be sent to two or more drone processes. The two or moredrone processes may be executed by different processors. At 604, pathdata relating to paths satisfying the pattern may be received from thetwo or more drone processes along with payload information correspondingto the paths. At 606, the path data received from the two or more droneprocesses may be aggregated so that two or more identical pathsappearing in the path data are reduced to a single occurrence of a path.At 608, the aggregated path data may be transmitted to a top dataidentification process. The top data identification process may producesummary data and a top number of results from the aggregated path data.

FIG. 7 is a flow diagram illustrating a method for identifying pathinformation containing a pattern in accordance with another embodimentof the present invention. The path information may relate to networknodes visited by users of a computer network. The method may be executedat a drone process. At 700, the pattern may be received from a masterprocess. At 702, all paths in the path information that satisfy thepattern may be identified. FIG. 8 is a flow diagram illustrating 702 ofFIG. 7 in more detail. At 800, all paths in the path information thatcontain a first node in the pattern may be identified. At 802, a datastructure may be created having, for each of the paths that contain thefirst node, an identification of a position in a path file of an offsetto where path information relating to the path begins, an identificationof a position of the first node in the pattern, an identification of aposition of a second node in the pattern, and an identification of athird node in the pattern. It should be noted that this embodimentassumes a three node pattern. However, embodiments are possible with anynumber of different nodes. Identifications of the positions of any nodesbeyond the first node may be initialized to invalid (e.g., −1). At 804,all paths in the data structure that contain the first and second nodesin the pattern may be identified. At 806, the data structure may beupdated to fill in identifications of positions of the second node forpaths in the data structure that contain the first and second nodes. At808, all paths in the data structure that contain the first, second, andthird nodes in the pattern may be identified. At 810, the data structuremay be updated to fill in identifications of positions of the third nodefor paths in the data structure that contain the first, second, andthird nodes.

Referring back to FIG. 7, at 704, paths corresponding to any paths inthe data structure that contain valid position information for thefirst, second, and third nodes may be extracted from the path file. Thismay include only paths that have a position for the second node lessthan a position for the third node, and a position for the first nodeless than a position for the second node. At 706, pattern matching maybe performed on the paths that satisfy the pattern to identify patternsthat satisfy additional constraints. At 708, the paths that satisfy thepattern may be aggregated so that two or more identical paths appearingin the path data are reduced to a single occurrence of a path. At 710,the paths that satisfy the pattern may be sent to the master process.

FIG. 9 is a block diagram illustrating an apparatus for identifying pathinformation containing a pattern in accordance with an embodiment of thepresent invention. The path information may relate to network nodesvisited by users of a computer network. The apparatus may be a masterprocess, such as 400 of FIG. 4. A pattern encoder 900 may encode thepattern in a format matching a format in which the path information isstored. Mapping information relating to the encoding may be stored in amapping file. A two or more drone process pattern sender 902 coupled tothe pattern encoder 900 may send the pattern to two or more droneprocesses. The two or more drone processes may be executed by differentprocessors. A satisfied pattern path data receiver 904 may receive pathdata relating to paths satisfying the pattern from the two or more droneprocesses along with payload information corresponding to the paths. Apath data aggregator 906 coupled to the satisfied pattern path datareceiver 904 may aggregate the path data received from the two or moredrone processes so that two or more identical paths appearing in thepath data are reduced to a single occurrence of a path. An aggregatedpath data top data identification process transmitter 908 coupled to thepath data aggregator 906 may transmit the aggregated path data to a topdata identification process. The top data identification process mayproduce summary data and a top number of results from the aggregatedpath data.

FIG. 10 is a flow diagram illustrating an apparatus for identifying pathinformation containing a pattern in accordance with another embodimentof the present invention. The path information may relate to networknodes visited by users of a computer network. The apparatus may be adrone process, such as 404 a or 404 b of FIG. 4. A master processpattern receiver 1000 may receive the pattern from a master process. Asatisfied pattern path information identifier 1002 coupled to the masterprocess pattern receiver 1002 may identify all paths in the pathinformation that satisfy the pattern. FIG. 11 is a block diagramillustrating 1002 of FIG. 10 in more detail. A first node pattern pathinformation identifier 1100 may identify all paths in the pathinformation that contain a first node in the pattern. A path patterndata structure creator 1102 coupled to the first node pattern pathinformation identifier 1100 may create a data structure having, for eachof the paths that contain the first node, an identification of aposition in a path file of an offset to where path information relatingto the path begins, an identification of a position of the first node inthe pattern, an identification of a position of a second node in thepattern, and an identification of a third node in the pattern. It shouldbe noted that this embodiment assumes a three node pattern. However,embodiments are possible with any number of different nodes.Identifications of the positions of any nodes beyond the first node maybe initialized to invalid (e.g., −1). A first and second node patternpath data structure identifier 1104 coupled to the path pattern datastructure creator 1102 may identify all paths in the data structure thatcontain the first and second nodes in the pattern. A second nodeposition data structure updater 1106 coupled to the first and secondnode pattern path data structure identifier 1104 may update the datastructure may be updated to fill in identifications of positions of thesecond node for paths in the data structure that contain the first andsecond nodes. A first, second, and third node pattern path datastructure identifier 1108 coupled to the second node position datastructure updater 1106 may identify all paths in the data structure thatcontain the first, second, and third nodes in the pattern. A third nodeposition data structure updater 1110 coupled to the first, second, andthird node pattern path data structure identifier 1108 may update thedata structure to fill in identifications of positions of the third nodefor paths in the data structure that contain the first, second, andthird nodes.

Referring back to FIG. 10, a pattern matching performer 1004 coupled tothe satisfied pattern path information identifier 1102 may performpattern matching on the paths that satisfy the pattern to identifypatterns that satisfy additional constraints. A valid path extractor1006 coupled to the pattern matching performer 1004 may extract pathscorresponding to any paths in the data structure that contain validposition information for the first, second, and third nodes from thepath file. This may include only paths that have a position for thesecond node less than a position for the third node, and a position forthe first node less than a position for the second node. A satisfiedpattern path aggregator 1008 coupled to the valid path extractor 1006may aggregate the paths that satisfy the pattern so that two or moreidentical paths appearing in the path data are reduced to a singleoccurrence of a path. A master process satisfied pattern path sender1110 coupled to the satisfied pattern path aggregator 1008 may send thepaths that satisfy the pattern to the master process.

It should also be noted that the present invention may be implemented onany computing platform and in any network topology in which searchcategorization is a useful functionality. For example and as illustratedin FIG. 12, implementations are contemplated in which the node pathfiles described herein is employed in a network containing personalcomputers 1202, media computing platforms 1203 (e.g., cable andsatellite set top boxes with navigation and recording capabilities(e.g., Tivo)), handheld computing devices (e.g., PDAs) 1204, cell phones1206, or any other type of portable communication platform. Users ofthese devices may navigate the network, and path information may becollected by server 1208. Server 1208 may then utilize the varioustechniques described above to store and access path information in anefficient manner. Applications may be resident on such devices, e.g., aspart of a browser or other application, or be served up from a remotesite, e.g., in a Web page, (represented by server 1208 and data store1210). The invention may also be practiced in a wide variety of networkenvironments (represented by network 1212), e.g., TCP/IP-based networks,telecommunications networks, wireless networks, etc.

While the invention has been particularly shown and described withreference to specific embodiments thereof, it will be understood bythose skilled in the art that changes in the form and details of thedisclosed embodiments may be made without departing from the spirit orscope of the invention. In addition, although various advantages,aspects, and objects of the present invention have been discussed hereinwith reference to various embodiments, it will be understood that thescope of the invention should not be limited by reference to suchadvantages, aspects, and objects. Rather, the scope of the inventionshould be determined with reference to the appended claims.

1. A method for identifying path information containing a pattern,wherein the path information relates to network nodes visited by usersof a computer network, the method comprising: sending the pattern ofpath information to two or more drone processes; receiving, from the twoor more drone processes, path data containing paths satisfying thepattern along with payload information corresponding to the paths;aggregating the path data received from the two or more drone processesso that two or more identical paths appearing in the path data arereduced to a single occurrence of a path; and transmitting theaggregated path data to a top data identification process.
 2. The methodof claim 1, wherein the two or more drone processes are executed bydifferent processors.
 3. The method of claim 1, further comprising:encoding the pattern in a format matching a format in which the pathinformation is stored.
 4. The method of claim 3, wherein mappinginformation relating to the encoding is stored in a mapping fileseparate from the path data.
 5. The method of claim 1, wherein the topdata identification process produces summary data containing a summaryof the path data and a top number of results from the aggregated pathdata.
 6. A method for identifying path information containing a pattern,wherein the path information relates to network nodes visited by usersof a computer network, the method executed at a drone process andcomprising: receiving the pattern from a master process; identifying allpaths in the path information that satisfy the pattern; and sending thepaths that satisfy the pattern to the master process.
 7. The method ofclaim 6, further comprising: aggregating the paths that satisfy thepattern so that two or more identical paths appearing in the path dataare reduced to a single occurrence of a path.
 8. The method of claim 6,further comprising: performing pattern matching on the paths thatsatisfy the pattern to identify patterns that satisfy additionalconstraints.
 9. The method of claim 6, wherein the identifying includes:identifying all paths in the path information that contain a first nodein the pattern; creating a data structure having, for each of the pathsthat contain the first node, an identification of a position in a pathfile of an offset to where path information relating to the path begins,an identification of a position of the first node in the pattern, and anidentification of a position of the second node in the pattern, whereinthe identifications of the positions of the second node are initializedto invalid; identifying all paths in the data structure that contain thefirst and second nodes in the pattern; and updating the data structureto fill in identifications of positions of the second node for paths inthe data structure that contain the first and second nodes.
 10. Themethod of claim 9, further comprising: extracting paths from the pathfile corresponding to any paths in the data structure that contain validposition information for both the first and second nodes.
 11. The methodof claim 9, further comprising: extracting paths from the path filecorresponding to any paths in the data structure that contain validposition information for both the first and second nodes and that alsocontain a position for the first node that is less than a position forthe second node.
 12. The method of claim 9, wherein the data structurefurther includes an identification of a position of a third node in thepattern and wherein the method further comprises: identifying all pathsin the data structure that contain the first, second, and third nodes inthe pattern; and updating the data structure to fill in identificationsof positions of the third node for paths in the data structure thatcontain the first, second, and third nodes.
 13. A system for identifyingpath information containing a pattern, wherein the path informationrelates to network nodes visited by users of a computer network, thesystem comprising: a master process; two or more drone processes; and atop path identification process; wherein the master process isconfigured to send pattern information to the two or more droneprocesses, receive aggregated path data from the two or more droneprocesses, aggregate the aggregated path data from the two or more droneprocesses, and transmit the results of the aggregation to the top pathidentification process; wherein the two or more drone processes are eachconfigured to identify paths in different sets of path information thatcontain the pattern, aggregate the identified paths, and return theaggregated path data to the master process; and wherein the top pathidentification process is configured to summarize and output a topnumber of results from the results transmitted from the master process.14. An apparatus for identifying path information containing a pattern,wherein the path information relates to network nodes visited by usersof a computer network, the apparatus comprising: a two or more droneprocess pattern sender; a satisfied pattern path data receiver; a pathdata aggregator coupled to the satisfied pattern path data receiver; andan aggregated path data top data identification process transmittercoupled to the path data aggregator.
 15. A drone apparatus foridentifying path information containing a pattern, wherein the pathinformation relates to network nodes visited by users of a computernetwork, the drone apparatus comprising: a master process patternreceiver; a satisfied pattern path information identifier coupled to themaster process pattern receiver; and a master process satisfied patternpath data sender coupled to the satisfied pattern path informationidentifier.
 16. An apparatus for identifying path information containinga pattern, wherein the path information relates to network nodes visitedby users of a computer network, the apparatus comprising: means forsending the pattern of path information to two or more drone processes;means for receiving, from the two or more drone processes, path datacontaining paths satisfying the pattern along with payload informationcorresponding to the paths; means for aggregating the path data receivedfrom the two or more drone processes so that two or more identical pathsappearing in the path data are reduced to a single occurrence of a path;and transmitting the aggregated path data to a top data identificationprocess.
 17. An drone apparatus for identifying path informationcontaining a pattern, wherein the path information relates to networknodes visited by users of a computer network, the drone apparatuscomprising: means for receiving the pattern from a master process; meansfor identifying all paths in the path information that satisfy thepattern; and means for sending the paths that satisfy the pattern to themaster process.
 18. A program storage device readable by a machine,tangibly embodying a program of instructions executable by the machineto perform a method for identifying path information containing apattern, wherein the path information relates to network nodes visitedby users of a computer network, the method comprising: sending thepattern of path information to two or more drone processes; receiving,from the two or more drone processes, path data containing pathssatisfying the pattern along with payload information corresponding tothe paths; aggregating the path data received from the two or more droneprocesses so that two or more identical paths appearing in the path dataare reduced to a single occurrence of a path; and transmitting theaggregated path data to a top data identification process.
 19. A programstorage device readable by a machine, tangibly embodying a program ofinstructions executable by the machine to perform a method foridentifying path information containing a pattern, wherein the pathinformation relates to network nodes visited by users of a computernetwork, the method executed at a drone process and comprising:receiving the pattern from a master process; identifying all paths inthe path information that satisfy the pattern; and sending the pathsthat satisfy the pattern to the master process.